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Abstract 

Gaussian process (GP) regression is a powerful tool for building predictive models 
for spatial systems. However, it does not scale efficiently for large datasets. Particu¬ 
larly, for high-dimensional spatial datasets, i.e., spatial datasets that contain exogenous 
variables, the performance of GP regression further deteriorates. This paper presents 
the Sparse Pseudo-input Local Kriging (SPLK) which approximates the full GP for 
spatial datasets with exogenous variables. SPLK employs orthogonal cuts which de¬ 
compose the domain into smaller subdomains and then applies a sparse approximation 
of the full GP in each subdomain. We obtain the continuity of the global predictor 
by imposing continuity constraints on the boundaries of the neighboring subdomains. 

The domain decomposition scheme applies independent covariance structures in each 
region, and as a result, SPLK captures heterogeneous covariance structures. SPLK 
achieves computational efficiency by utilizing sparse approximation in each subdomain 
which enables SPLK to accommodate large subdomains that contain many data points 
and possess a homogenous covariance structure. We Apply the proposed method to 
real and simulated datasets. We conclude that the combination of orthogonal cuts 
and sparse approximation makes the proposed method an efficient algorithm for high¬ 
dimensional large spatial datasets. 

Keywords: Gaussian process regression, Local Kriging, Sparse approximation, Spatial 
datasets 


1 Introduction 


Recent advances in data collection technologies in Geostatistics, such as ubiquitous satel¬ 
lites around the planet, have created an unprecedented opportunity for utilizing data-driven 
predictive models. Particularly, Kriging (Cressie, 1990), also known as Gaussian Process 
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(GP) regression (Rasmussen and Williams, 2006]), can theoretically benefit from very large 
datasets since it is a non-parametric model whose flexibility and performance generally 
increase by having more data points (Hastie et ah, 2009). However, the computational 


complexity hinders efficient application of GP regression to large datasets. Specifically, 
GP regression is dominated by the inversion of covariance matrices which is of 0(7V 3 ), 
where N is the number of data points. A significant body of research in GP regression 
deals with constructing an approximated covariance matrix, which is less costly to invert. 
These approximation methods either result in sparse or low-ranked covariance matrices, 


which reduce the cost of matrix inversion (see Silverman (1985); Snelson and Ghahramani 


(2006); Tresp (2000); [Du et al. (2009)). 

The fact that most spatial data are non-stationary presents still another challenge, i.e., 
the covariance structure changes for different locations in the space, primarily when the 


behavior of the response variable strongly depends on the underlying geology (Kim et al. 


2005). Thus, in the context of spatial data, GP regression needs to efficiently address 


the large number of data points while effectively capturing the inhomogeneous covariance 
structure. 

One idea that empowers GP to tackle inhomogeneous data structures is through the 
class of local Kriging that assumes distinct covariance functions for each region of the 
domain (Haas, 1990). Based on this idea, we can decompose the domain into smaller 


subdomains and apply local GP in each subdomain. As such, local Kriging breaks down 
the global covariance matrix to smaller local matrices, and reduces the total computational 
complexity to 0(Nn 2 ), where n is the number of local data points. This idea, however, also 
presents a challenge: the discontinuities in prediction on boundaries of the subdomains. 
The Domain Decomposition Method (DDM) is one method in the class Local Kriging 
that guarantees the continuity of GP on boundaries by defining a few designated control 
points on each boundary and imposing equality of neighboring local predictors at these 


points (Park et al., 2011). The superiority of DDM over similar methods such as weighted 

average techniques ( 

Urtasun and Darrell, 

2008; Rasmussen and Ghahramani, 2002; 

Gra- 

macy and Lee 

2008 

) emerges as a result of effectively handling the boundary conditions, 


although the exponential growth in the number of control points as the dimension of the 
domain increases hinders DDM’s ability to handle dimensional domains greater than two. 
This is a major drawback since some spatial datasets contain not only the location infor¬ 
mation as inputs, but also other geographical/meteorological information which results in 
a dataset with higher dimensional inputs. 

Motivated by these challenges, we propose a new method for local Kriging which pre¬ 
serves the continuity of GP in higher dimensional spaces. The new method employs a 
flexible policy to partition domains, thus minimizing the number of boundaries for the 
subdomains. Such partitioning may end in large subdomains that contain a large number 
of data points, which makes the application of a full GP impractical in each subdomain. In 
order to obviate this obstacle, we utilize methods of sparse approximation for each region. 
The computational complexity of this algorithm is 0(7Vm 2 ), where m is the number of 
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pseudo-data points in each subdomain, i.e., the data points that help to achieve sparse 
approximation. The major contribution of our method is that it preserves the continuity 
of GP while it handles non-stationarity for high dimensional spatial datasets. 

The remainder of the paper is organized as follows. Section [2] briefly introduces GP 
regression, along with SPGP and DDM. Section [3] presents the proposed method. Section[4] 
compares the proposed method with competing algorithms. Section [5] concludes the paper 
and suggests future research. 


2 Gaussian Process Regression 


Given a probability space (£1, J 7 , V), and an index Set T, where ft is the sample space, J 7 is 
a sigma-algebra, and V is a probability measure, a Gaussian stochastic process, or simply 
a Gaussian process (GP), is a stochastic process where for any T' as a finite subset of T, 


the random vector {f t : t E T 7 } follows a multivariate normal distribution (Rasmussen and 
Williams, 2006), where ft is a measurable function from ft to R for a given t. Here, we 


are not interested in GPs that model the time evolution of a process, i.e., when the index 


set T is simply time (Wiener, 1949), but instead we focus on statistical learning (Hastie 
|et al.[|2009| ). Specifically, we consider the index set to be a subset of R p . Then, for a given 
{xi, X 2 ,..., x^v} E R p , the random vector {/i, / 2 ,..., /w} follows a multivariate normal 
distribution, where ft — /(x^), for i = 1,..., N. 

We call D — {(x^,/)) : i = 1,..,7V} a training dataset, and the goal is to infer /(x*) 
for a given x* G R p , based on the information in D and the GP assumption. That is, 
assuming f ~ A/"(/x,K), where f = [/i, / 2 , • • •, /w] T > and /x and K are the mean vector 
and covariance matrix, respectively, we seek p(/(x*)|f), i.e., the conditional distribution 
of /(x*), given f. We note that the mean vector fi = E{f} and the covariance matrix 
K = E {(f — /x)(f — /x) T }, where E{-} denotes the expectation operator. If the covariance 
between /(x^) and /(xj) is determined solely by the values of their inputs, x^ and Xj, we 
can denote the covariance by fc(x^,Xj) = E{(/) — — Mj)}, where i±i — E {fi}, and 

fij — E{/j}, the ith and jth elements of /x, respectively. For fc(-, •) to be a valid covariance 
function, it must be positive definite, i.e., 


M x > x ') f(-x.)f(-x')v(dx)v(dx') > 0, 


( 1 ) 


for any /(•) € L 2 (M P , u) where u is a measure. In fact, a GP is uniquely determined by its 
mean function E{/(x)} and covariance function fc(x, x 7 ) (Rasmussen and Williams, 2006). 
The covariance function plays an important role since it defines the notion of similarity be¬ 
tween different points on the domain and determines the sample path properties of the GP. 
Furthermore, the form of the covariance function specifies the computational complexity 
of the GP regression, which we discuss next. 

In most practical settings, the training dataset consists of noise-contaminated observa¬ 
tions yi, i.e., yi = /(x^ + e^, where e ~ A/*(0, a 2 ). Specifically, we have D = {(x^, yf) : i = 


(Rasmussen and Williams, 2006 
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1,.., TV} and seek p(/(x*)|y), the predictive distribution at x*, where y = [yi, i/2> • • •, Vn ] T • 
When yi G M, for i = 1, 2,..., TV, we call the problem of estimating the predictive distri¬ 
bution p(/(x*)|y), given the noisy observations y, the GP regression. 

There are equivalent ways to present the inference based on the GP regression, for 


example through regularized optimization (Poggio and Girosi, 1990), optimal spatial pre¬ 


diction (Cressie and Wikle, 2011), or an extension of Bayesian linear regression. In this 


paper, we adopt a Bayesian treatment from Rasmussen and Williams (2006), where we 


specify a prior for f and then calculate the predictive distribution at x*. Given a train¬ 
ing dataset D , assume the prior distribution f ~ A/"(0,K), where K^ = fc(x^,Xj) for 
i, j G {1,... ,7V}. This implies y|f ^ 7V(f, cr 2 1), where I is the TV x TV identity matrix. To 
find the marginal distribution of y, we can integrate out /, i.e., 


p(y) = / p(y| f M f )df, 


( 2 ) 


which results in another normal distribution, namely y ~ M{ 0, K + <j 2 I). In other words, 
we have a new GP whose covariance function is fc(x, x') + cr 2 <5(x. x'), where 5(-,-) is the 
Kronecker delta. 

Therefore, based on the joint normality assumption, 


(y, /*) ~ -V (o, 


K + (J 2 1 

. K L 


K d * 

k 


where /* = /(x*), K^* is an N x 1 vector of covariances between the inputs in D and x*, 
and k ** is the variance at x *. Hence, 


(/*|y) ~ V (K^(K + a 2 l)~ l y, k,, - K^(K + a 2 !)- 1 ^*) , (3) 


i.e., the predictive distribution of / at x* is normal whose mean is a linear combination of 
the observations. The predictive variance is smaller than the prior variance k ** by a factor 
determined by the covariance between the test point x* and the training points. We are 
interested in the fact that the mean of the predictive distribution is a linear combination 
of the observation in the training dataset. Among all the unbiased linear predictors for the 
mean of the predictive distribution, i.e., /i(x*) = u(x*) T y, the mean in Q has the smallest 
expected prediction error, E{(/2(x*) — /*) 2 }. 

Note that implementing the GP regression entails inverting matrices of size TV, where TV 
is the number of training data points. Hence, the computational complexity is of (9(TV 3 ), 
which can be too slow for many practical applications, in particular, spatial statistics. 
Approximating the covariance matrix as a method to reduce the computational cost has 


been studied by ( 

Snelson and Ghahramani, 2006; 

Smola and Bartlett, 

200l[ Williams and 

Seeger 

2001; 

Lawrence et ah, 2003 

Pourhabib et ah, 2014; 

Park et ah, 

2011 

). In this 


paper, we present a method that utilizes a covariance structure, which effectively captures 
the non-stationarity in the data and, at the same time, is computationally efficient. Before 
presenting the proposed method, we discuss two existing methods in more detail. 
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2.1 Sparse Pseudo-input Gaussian Process 


Sparse Pseudo-input Gaussian Process (SPGP) is a low-ranked covariance approximation 
method that uses a set of m unobserved pseudo-data points D — {x^, fi : i = l,..,ra}, 
where m <C N. The set D denotes realization of another GP with the same covariance 
function as the underlying GP, /. Unlike the training dataset D , the pseudo-data points in 
D are not actual observations, but they are a new set of parameters which SPGP utilizes 
to approximate the likelihood of GP (Snelson and Ghahramani, 2006). Specifically, instead 
of using the existing data points in D to build a Nystrom approximation of the covariance 
matrix (Williams and Seeger, 2001), SPGP uses D to build 


Q d,d — 


(4) 


where is the covariance matrix at the pseudo-data locations and K D jj is the covariance 
matrix between the data points in D and D. To achieve a computationally efficient GP, 
assume that the observations y are conditionally independent, given the pseudo-outputs 

f = 


jm]T■, i.e., 


N 


N 


p(y|f) = IP«) = IU Kx^kV f, k. 


k XiS kVk Sxi + ct 2 


(5) 


2=1 


2=1 


= V(K DS Kyf, diag(K D - Q dD ) + a 2 1). 


Then, putting a prior distribution over f as p( f) ^ ^(0, K^) and integrating it out yields 

p( y) = Jp{y\f)p{f)df ~ V( 0 , Qjv + diag(K D - Q D ) + a 2 1). (6) 

Therefore, we obtain the predictive distribution by conditioning the prior distribution on 
y as follows: 

M/*|y) = Q d* (Q D + diag(K D - Q d S )cj 2 I) _1 y, (7) 

^ 2 (f*\y) = k** ~ (Q D d + diag(K D - Q D D )a 2 l) 1 Qd*, (8) 

where Q^* is the low-ranked covariance matrix Q in which we use the test point x* instead 
of D. 

Calculating 0 and ([8|) still requires inverting the matrices of size N x TV; however, 


by taking advantage of the matrix inversion lemma (Rasmussen and Williams, 2006) the 
computational complexity reduces to 0(Nm 2 ). Moreover, when pseudo-data points are 
considered as parameters of the model, their locations are learnt simultaneously along 
with the hyperparameters of the covariance function through maximizing the marginal 
likelihood, which is constructed based on the low-ranked covariance Q. 

SPGP can be considered as a GP with a particular covariance function 


K s GjP ( X , X ') = Q(x,x 7 ) + S(x,x)(K(x,x') - Q(x,x)), 


(9) 
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where S(x,z) is the Kronecker delta, and Q(x,x / ) = KxdK^K^x/ (Snelson 


In 
dis¬ 
cussed inlQuinonero-Candela and Rasmussenl (120051). Despite improving the computational 


addition, we can obtain SPGP as an extension of the sparse approximation framewor' 


2007). ] 

3work di 


complexity of the full GP, SPGP’s performance depends on the number of pseudo-data 
points used, which in general should be large enough to obtain good prediction results. 
Moreover, SPGP cannot effectively handle the non-stationarity in the data. The latter is¬ 
sue can be dealt by selecting a non-stationary covariance matrix if (x, x 7 ); however, learning 
hyperparameters of non-stationary covariance functions is not as straightforward as that 
for stationary ones. 


2.2 Domain Decomposition Method 

Local Kriging refers to a class of methods that treats the challenges of GP application to 
spatial data, i.e., handling non-stationarity and dealing with large datasets, in a unifying 
framework. The idea that pairs of data points correlate less as the distance between 
the points increases, suggests accounting only for the neighboring data points. In other 
words, we decompose the domain into smaller pieces, or subdomains, in order to use local 
predictors, i.e., a GP trained only on the data in a specific subdomain. We can train these 
local predictors independently with respect to the local data points, and as a result handle 
the non-stationarity, because we assume differnet covariance functions for each subdomain. 
Moreover, dealing with local covariance matrices instead of a global covariance matrix, 
which we can invert more efficiently, reduces the computational complexity to 0(Nn 2 ), 
where n is the number of data points in each subdomain. Note that this reduction in 
training time by considering neighboring points is advantageous in real time learning where 
decision making in short intervals is crucial (Nguyen-Tuong et ah, 2009). We call these 


methods, which do not impose any continuity constraint on the boundaries, naive local 
Kriging. Note that using naive local Kriging methods deteriorates the prediction accuracy 
for test points close to boundaries, since neighboring local models are not continuous on 
their joint boundary as a result of independent training on their respective subdomains. 
Various weighted average techniques have tried to alleviate the discontinuity problem and 
improve Local Kriging methods by adhering together the local predictors on boundaries, 


such as Bayesian committee machine (Tresp, 2000), local probabilistic regression (Urtasun 


and Darrell, 2008), mixture of Gaussian process experts (Rasmussen and Ghahramani 


2002), and the treed Gaussian process model (Gramacy and Lee, 2008). Although weighted 


average techniques mitigate the boundary condition problem, they cannot speed up the 
original algorithm as efficiently as naive local Kriging methods, due to the computational 
burden exerted on the algorithm. 


Domain Decomposition Method (DDM) (Park et ah, 2011), a new method in local 


Kriging, treats the discontinuities in prediction on the boundaries of subdomains with less 
computational burden. Noting that the mean of the GP posterior function is the optimal 
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( 10 ) 


(unbiased) linear predictor, consider as an unbiased linear predictor for /*, 

//(a;*) = u(a;*)V, 

where E(/x(cc*)) = /j(:c), i.e., unbiasedness, and u(a;*) £ M iV . Then define the optimization 
problem which finds a vector u(x*) that minimizes the mean squared prediction error, 

E {//(£*) - /*} 2 = u(x*) t (cr 2 I + Kd)^*) - 2u(:r*) t K jC >* + K *, (11) 


where I is the identity matrix. We can show that the solution for u(x*) has the form of 
AK^, where A is a N x TV matrix. Therefore we can obtain the best linear predictor by 
solving the following minimization problem with respect to A: 

min K^*AV 2 I + K d )AK d * - 2K t D *A t K D *. (12) 

A 


Solving optimization (12) is still in the order of 0(N 3 ). To overcome this problem, 
DDM decomposes the domain of the data into S smaller subdomains Qj for j = 1,..., S', 
and tries to solve the S local optimization problems as follows: 


min K], A t j (o?I + K D .)A j K D ..-2K t D A$K D .. VSl j bx.€Sl j . (13) 

A n 


To address the discontinuity problem at the domain boundaries, DDM imposes conti¬ 
nuity constraints on the optimization problem. Specifically, define Tjk = flj D as the 
boundary of two neighboring subdomains fand and let bjk denote the data points on 
this boundary. DDM imposes that the two local predictors defined on Qj and must be 
equal at these points, specifically 


K‘ DAt A‘y,. = KS,^,. (14) 

In order to systematically insert constraint into the local optimizers, DDM locates 
q control points on each boundary and creates a polynomial function of a certain degree 
p where (p > q), which interpolates these q control points. To obtain such a polynomial 
function, we utilize p Lagrange basis polynomials. Therefore, the local problems will be 


min A t j (a‘jI + K Dj )A j K Dj ^-2K t D A^Kc^ x* £ ilj 

A 3 

subject to ’K t D . b . k A t -y j = T* fc r jfc k £ neighborhood^'), 


VDj Sz x* G Qj 

(15) 


where T ^ is the matrix of p Lagrange basis polynomials at bjk and is the evaluation of 
the function in q control points on the boundary. By solving (15), the local mean predictor 
becomes 


= Ky * (a] I+ K d .) 1 y j + cj , 


(16) 
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where c j is a correction term. Intuitively, we interpret the local mean predictor as the 
sum of two terms. The first term is the standard local GP predictor and the second term 
appears because of the continuity constraint, which approaches to zero as the test data 
points move away from boundaries. See Park et al. (2011) for the derivation of c j and 
specific implementation details. 

DDM uses rectangular meshing, i.e., it partitions the domain into smaller rectangular 
subdomains, and assumes each boundary is shared by at most two subdomains. As such, 
it can effectively be applied only to two-dimensional spatial datasets. Furthermore, since 
DDM utilizes a full GP in each subdomain, the size of subdomians should be small enough 
so that the overall procedure runs efficiently. 


3 Sparse Pseudo-input Local Kriging 


We propose a GP approximation method which can effectively address the non-stationarity 
in large spatial datasets. The task is contingent upon an efficient partitioning of the domain 
such that in each region the data is stationary and fast GP predictors can be applied 
to each subdomain. In addition, the local predictors need to seamlessly connect on the 
boundaries. Section 3T presents the partitioning scheme which enables us to partition 
domains in larger dimensional space, i.e., for d > 2, and Section T2 presents the formulas 
for mean and variance predictions. We discuss specific details about implementation in 
Sections 13.31 and on 


3.1 Orthogonal Cuts in the Domain 


Recall D = {(xj, yi) : i = 1 , for x,; G X C R d denotes the training dataset. Assume 

A is a d-dimensional box whose boundaries are either orthogonal or parallel to each other, 
i.e., A is a d-orthope or hyper-rectangle (Coxeter, 1973), which is a generalization of 
the notion of a rectangle to higher dimensional spaces. Koppen (2000) shows that the 
boundaries of a hyper-rectangle that resides in a d-dimensional space are lower dimensional 
orthopes, and the number of such m-orthopes are 


N d = 2 

m 


d—m 


d 


m 


(17) 


for m = 0,..., d — 1, where each (N^/d) of m-orthopes are parallel to each other. We 
denote m — 0 as vertex, mn — 1 as edge, and m — d — 1 as the face of the d-dimensional 
orthopes. 

Also, we define an orthogonal cut on a d—orthope domain as a hyperplane, i.e., a (d—1)- 
dimensional linear space embedded in a d-dimensional space, which is either orthogonal 
or parallel to each boundary; consequently, each d-orthope can be cut at most from d 
directions. The proof of the recent claim is straightforward, since if we move the domain 




















to the origin of the space and rotate it so that all the boundaries become orthogonal to 
one of the axis of the space, the above statement is equivalent to cutting the d-orthope 
with respect to each axis, which is possible in d ways. Figure [l] illustrates an example of 
orthogonal and non-orthogonal hyperplanes cutting a three-orthope domain. 



(a) Non-orthogonal Cuts (b) Orthogonal Cut 


Figure 1 : Cutting a three-orthope by two-dimensional hyperplanes 


Recall that after cutting the domain of the training data and applying a GP in each 
subdomain, we need to connect two neighboring local predictors on their joint boundary, 
i.e., the intersection of the cut and the domain. Define a few control points on the bound¬ 
aries and stitching the local predictors in these points together. Note that the number of 
control points cannot be arbitrarily large, since the additional computations reduce the 
efficiency of the algorithm; consequently, we need to control the growth of the number of 
boundary points. Denote the total number of control points in the whole domain as T(q), 
number of subdomains as S', number of boundaries of each subdomain as V and number of 
control points on each boundary as q. T(q ) is proportional to these three factors, i.e., 


T(q) oc qST. 


(18) 


The growth of T(q) is dominated by the exponential growth of < 7 , since the boundary 
spaces expand with an exponential rate and more control points are required for higher 
dimensional spaces. Note that this is not a critical issue for most spatial datasets, since 
the dimension of the dataset is generally smaller than 6 . Therefore, controlling the other 
two factors curbs the growth of T(q). 

Simple orthogonal cutting may result in very large subdomain, where a full GP is 
inefficient. However, as Section |3.2| discusses, we fit a sparse GP for each subdomains. 


This enables us to expand the size of subdomains, and reduce S to as little as two by 


halving the domain. See section [473] for several considerations to tune S to an appropriate 
value. 

T is another parameter we can control by introducing the following meshing policy: 
Under the stated cutting condition, i.e., orthogonal cuts, a d-orthope domain can be de- 
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composed into S smaller components, each of which is a smaller d-orthope and adjacent to 
other d-orthopes in its faces. In fact, each face represents a neighbor in this terminology, 
and according to © a d-orthope has at most 2d faces and subsequently, 2d neighbors. 
Consider that each 2d/d = 2 of these faces are parallel and will appear because of the cuts 
which are parallel to them, or in other words, cutting from each direction creates at most 
2 neighbors for a subdomain. If all the potential neighbors of subdomians are desired, cut 
the domain from all d directions; consequently, T is a function of the directions of the cuts, 
which is minimized when all cuts are parallel. Note that this policy is not efficient for the 
naive local Kriging, since it causes the width of subdomains to narrow in order to keep 
them small. On the contrary, using a sparse GP in each subdomain will accommodate the 
large size subdomains. In this manner, regardless of the dimension of the space, if we slice 
the domain with c parallel cuts we will encounter S = c+ 1 subdomains, each of which has 
at most two neighbors and the total number of boundaries is c. Figure [2a| shows a three- 
orthope which is cut from all three directions, and as a result some sub-three-orthopes 
have six neighbors, whereas Figure [2b| shows the current meshing policy with at most two 
neighbors for each sub-three-orthope. As such, the total number of control points is equal 
to 


T(q) = cq. 


(19) 





(a) Cutting a three-orthope from all direc- (b) Cutting a three-orthope with parallel 
tions cuts (The current meshing policy) 


Figure 2: Two different meshing policies for a three-orthope 


3.2 Mean and Variance Prediction 


The partitioning scheme presented in Section |3.1| may lead to subdomains containing a 
large number of training data points, which makes the application of a full GP inefficient. 
We propose to utilize a sparse covariance function which will speed up model fitting in 
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each subdomain and make the overall algorithm efficient. Recall that the joint distribution 
of observed variables and the prediction function under the SPGP assumptions is 

Q d + diag(K D - Q D ) + a 2 I Q D * 

Q *d K* 

Based on equation the mean of GP is the best unbiased estimator u(x*) t y. Under 
the covariance structure imposed in ( [20| ), we have 

E{/i(z*) - /*} 2 u(x*Y(Q d + diag(K D - Q D ) + a 2 I)u(x *) - 2u(x*YQd* + A*. (21) 


p(yj*) o, 



Note that by replacing u(x*) = (Q^ + diag( Kjj — Q d) + cr 2 I) _1 Q£>*, equation ( |2l| ) is 
equivalent to the variance of SPGP predictive distribution, which is indeed the minimum 
value of (21); consequently, u(x*) is restricted to the form of AQ^*, where A G R NxN , 
and the global minimizer emerges as 


arg min Q t D ^A t (Q d + diag( — Q^) + a 2 T)AQ]j* — 2Q t D *A t QD*. (22) 

A 


In order to take advantage of Local Kriging, we decompose the domain into S subdo¬ 
mains Qj for j = 1, ..,5, where each Qj has mj local pseudo-data points. Unlike SPGP 
the number of pseudo-data points is no longer a global parameter, but it differs from one 
subdomain to another (see Section (3.4)), and as a result, the local low-ranked covariance 
matrix with respect to mj is 


Qu 


K,„, K 'K 


(23) 


By decomposing the domain, the global optimizer breaks down to S smaller local op¬ 
timizers: 


min 

A, 


QW 


A*(Qj Dj + diag(K Dj 


Qj 


Di 


+ <rjl) Aj 


Qj 


Di 


2Qj 


Di 


fA-jQj Dj* 


VOj h x* G flj,(24) 


where Aj G R n i xn o and therefore the local predictors have the form 

Uj(z*) = Qj^*A5-y_j V%. (25) 

Moreover, we define Tj^ as the joint boundary of two neighboring subdomains Qj and 
Qk, and locate q control points on each Tj&, (bj&, rj&), where rj& is a q x 1 vector containing 
the predicted mean of the q locations bj&. To obviate the discontinuity problem, the model 
forces mean predictors of the two subdomains Qj and Q & to be equal to the predicted value 
of control points on Tjk, he., 

Qj Djb jk Ajyj = Y]k bjk G Tj/c- (26) 
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Denote T j as the collection of all boundaries of ftj and integrate all boundary points 
for this subdomain into (bj,rj) = Ur\,- fc er\,- (bj&, r jk)- The boundary conditions for the 
subdomain j becomes 


Qibjbj-^jyj ~ r j foj ’ r i) ^ Tj. (27) 

By adding the above constraint to local optimizers, the local minimization model for 
each Qj appears as 

mm QjD,*A‘(Qj D) + diag(K D . - Qj D .) + cr]l)A j Qj £) ^ - 2Qj^ . A*Qj D ^ 

Aj (28) 
subject to Qj^-bj Aj-yj = r 7 (bj,rj) G r> 

Not that since the acquired model has a nonlinear objective, we need to take advantage 
of nonlinear optimization techniques to obtain the solution. Since our model contains only 
equality constrains, we use Lagrange multipliers to transform the constrained optimization 
into a Lagrangian function, 


A ( A i> A i) = QJ D j * A 'j(.Q3Dj + diag(K Dj 


QjDj) + yi)A/Qj 


Dj* ~ - Qj Dj * A*- Qj Dj * 
-* t j(.Qi t D j b j At jyj - r i)> 


(29) 


where A j is a vector of the Lagrange multipliers having a size equal to the number of all 
boundary points of subdomain j. The tuple (A*, A*) minimizes such a function if it satisfies 
the following system of equations of partial derivatives, 


( jj^ . A ( A n V) — 6, 

(30) 

A A ( A ^) = o. 

(31) 


We do not present the procedure for solving this system of equations since it is tedious 
and similar to the solution of DDM (see Park et al. (2011) for details), instead giving the 
solutions, 


A j — 2G j 

A jQlDj* 


r i - Qjkygjyj 

= + 0.5yjA*k 6j *), 


(32) 

(33) 


where 


H i = (Q-bj + diag(K Dj - Qj D .) + a]l) 1 (34) 

Gj = ({diag 1/2 [(Kl jbj K^)" 1 ]!^} o {Qj^. Qi D . b .diag[(Qj t Djbj Qj^b,) -1 ]}^) 
k bj* = [(QjL Qjb,*)" 172 K 6 .*] o [Qj* D . 6 . Qj^*) -1 ]) (36) 
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where o denotes the entry wise product. Consequently, we obtain the local mean predictor 
by plugging in ( [33] ) into ( |25| ), which results in 

(37) 


fo( x *) = Qjk* H jyj + K L* G i( r i - Qjkb.Hjyj)- 


In fact, the objective function of the minimization model is the local variance predictor 
and therefore by plugging ([33]) into (|24|) the predictive variance becomes 


<7j(a;*) = K* - QjV* H j Qjn 


(38) 


+K 




Qj 




Qikb,- Hj yj )* G* */ (y!-Hj ), 


where K* is a constant matrix that did not appear in the optimization procedure. Note 


that in both (37) and (38) the first term is exactly the predictive mean and variance of 
local SPGP, and the terms coming after it, which are amplified for local points close to 
boundaries, appear to maintain the continuity of the global predictive function. 

Calculating the local mean and variance predictors requires estimating the values of 
the control points, i.e., vector rj. Following Park et al. (2011), we estimate the values so 
that minimize the summation of all local variances with respect to r j. Specifically, 


r = aigmin{ dj-fo)}, 


(39) 


which leads to the following estimator for rj, 

y^jyj Qj D&k'Rjyj + y^y* Qik^ 11 ^ 


*jk = 


ytjH-jyj + yfcH fc y fc 


v(bjfc, Tjk) € Tj nr k , (40) 


where we omit the details of the derivation and refer the reader to Park et al. (2011) 


Intuitively, we can interpret the above estimation for r jj~ as the weighted average of 
two neighboring subdomains Qj and fi*., where each has a weight equal to yjHiyi for 
l = j, k. Note that in the current method each boundary is exactly the intersection of 
two subdomains, whereas for any other meshing policies with more subdomains associated 


with a boundary we need to use the more complicated version of (40), but by doing so, 
the algorithm will become computationally inefficient. 

Writing the above equations means that we must invert the matrices of size rij x n 


'ji 


and obviously it is time consuming for large rij\ However, by using matrix inversion 


lemma (Snelson, 2007), known as Woodbury, Sherman and Morrison formula, the com¬ 
putational complexity reduces from O(n^) to 0(?7yra?), where mj <C rij. Consequently, the 
total computational complexity becomes 0(Srijmn 2 j ), or equivalently 0(Nm 2 ), which means 
that the current algorithm can handle the non-stationary data in a time asymptotically 
equivalent to that of SPGP. We name the proposed method that decomposes the domain 
into smaller subdomains and then utilizes a sparse approximation in each subdomain the 
Sparse Pseudo-input Local Kriging (SPLK). 
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3.3 Hyperparameter learning 


Maximizing the marginal likelihood of the training data, p(y ), is a popular for learning 
the hyperparameters of a model ( Rasmussen and Williams] |2006| ) . In the current method, 
instead of one global marginal likelihood function, we have S local functions p(yj ), each 
of which can be trained independently. This is the key factor in handling non-stationarity 
of the data, since SPLK is able to use appropriate hyperparameters for each subdomain 
depending on the behavior of local data points. Also, recall that our local predictors are in 
fact SPGP predictors that consider pseudo-data points as parameters of the . Therefore, 
we have two types of parameters: one is the location of local pseudo-inputs and the other 
is the hyperparameters of the underlying covariance function. Maximizing the logarithm 
of local SPGP marginal likelihood functions using gradient descent with respect to local 
pseudo-inputs and hyperparameters provides local optimal locations for them. Specifically, 
the log marginal likelihood of each SPLK’s local model j is 


log(p(yj)) = -I log |Aj-| - iyjA . l Yj - ^ log27r, (41) 


where Aj = Qj Dj + diag(K Dj - Qj D .) + ofI. 

Snelson and Ghahramani 

(2006 

) show that 

the cost of finding the derivatives and maximizing (41) is in order of O 


’ + njrnjP ) 


that results in 0(Nm‘j + NrrijP) as the total computational complexity for training step, 
where P is the number of hyperparameters of the covariance function and mj is the number 
of local pseudo-inputs. Note that the training cost is not a function of rij\ consequently, 
the SPLK is flexible to use large size subdomains without loss of efficiency. This is not the 
case for naive local Kriging, which maximizes the local marginal likelihood, 


log(p(yj)) = — log |K j + a] I| - -y j (Kj + of I 


\-l 


Tin 


y j - y log2vr, 


(42) 


since maximizing (42) requires inverting the local covariance matrix which has a cost of 
0(n|). Due to the dependence of training cost to nj naive local Kriging is restricted to 
use small size subdomains, since large covariance matrices are costly to invert. 


3.4 Number of pseudo-inputs and the location of control points 

One important parameter that affects the performance of SPLK is the number of pseudo¬ 
inputs, nrij , located in each subdomain. One factor that determines the value of rrij is the 
stationarity of the data, since locating numerous pseudo-inputs does not improve the pre¬ 
dictive power of SPLK for stationary data structures. In addition, there is always a positive 
correlation between the number of pseudo-inputs and the training time. Consequently, the 
model is constrained to tune this parameter in a way that maximizes the precision while 
controlling the training time. Based on empirical results, setting nrij to a proportion of the 
square root of the number of local training data points, n^, provides good prediction in a 
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relatively short time. As such, we propose the formula 


rrin = 


fc v / n“ 


( 43 ) 


where k is a real number between 0.1 and 4. In fact, we can think of k as a measure of 
stationarity: we can set A; to a small value if the data is stationary, while we need to choose 


a larger value as the non-stationarity increases. See Section [4. 3. 1 | for a sensitivity analysis 
regarding the value of k. 

Another important issue is the location of the control points on each (d — l)-dimensional 
boundaries. Since we stitch two neighboring subdomains in a limited number of control 
points, these points should effectively cover the whole boundary. Therefore, we introduce 
a practical procedure which distributes q given control points over the boundary. 

Consider the way we construct the d-orthope domain, where each edge of the d-orthope 
is orthogonal to one of the d axes of the space, (ax) 1 for i = 1 ... d. Therefore, all the cuts 
under the stated meshing policy are parallel to each other and orthogonal to a specific (ax) 1 . 
Let the set {C l - : j = 1 ,..., c} be all the c parallel cuts, which are orthogonal to {ax) 1 for 
each i = 1 ,..., d. Also, let the set {B l - : j — 1 ,.., c} denote the faces that appear because 
of those cuts. Note that there is a one to one relationship between C l - and B j, and that all 

coordinate in common, since the corresponding C l - is orthogonal 

In order to determine the rest of 
(d— 1 ) coordinates of vertices, define a tuple, r^, which contains the maximum and minimum 
values of the k th dimension of the training data, i.e., = (r£, 7 ~J), for k = 1 ,..., d, k 7 ^ i, 

where r® = min k {D), the minimum value of the k th coordinate in the dataset D , and 
7 ~J = ma x. k {D), the maximum value of the k th coordinate in D. Specifically, r® = min x^ 
where = [xn ,..., x^] T , and similarly for 7 ~J. Therefore, denote each vertex oT 1 ^ as 


vertices in a B l - have the z th 
to {ax) 1 . Denote this common coordinate for B ,• as d l - 


v* = 


T \ ,.. •, d^,..., j = 1 .. .c ,4 = 0,1, for k = 1, — , d. 


(44) 


where £ — {^ 1 ,... ,^} 5 an d the total number of vertices becomes 2 d 1 for each subdomain. 

Note that we obtain the number of vertices in each boundary by setting m = 0 in ©■ 

Next, determine the location of control points on each B 1 -. Along each axis, partition the 

boundary into A G N parts. Specifically, define the arrays = {r£, r® + 1 r® + 2..., r^} 

r i_ r o 

for k = l..d, k 7 ^ z, where l k = k A k . Then, each control point has a representation 

CP^ = {a[\a e 2 2 ,... ,dj,... ,a e d d ) VCP^ G B},j = 1... c,£ k = 1,..., A for k = 1,... ,d,(45) 

where £ = {^i,..., denotes the element of the array a^ k . The total number of 

control points in each subdomain is Yli 1 Ot 1 ) = (A + l) d_1 . This is where the exponential 
growth in the number of control points appears. However, experimental results show that 
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the model works well even with A = 3. See Section (4.4) for details. Finally, the total 
number of control points will be 


T(q) = c(A + 1) 


d -1 


(46) 


We can choose the width of each subdomain in two different ways. Using a fixed 
width for each subdomain results in a different number of data points for the subdomains, 
depending on the density of the data points in different parts of the domain, while using 
a different width for each subdomain results in the same number of training data points 
in each subdomain. Note that in the later case, the number of pseudo-inputs can be 
considered as a global parameter. Note, also, that the selection of control points according 


to (45) will benefit from having axes along the direction of maximum variance, because 
there is a better overlap of the control points with the training data. In order to benefit 
from maximum variability condition, we can simply rotate the original axes by using, for 
example, Principal Component Analysis (Jolliffe, 2002), such that the new axes are along 
the direction of maximum variability. 


4 Experimental Results 

In this section we apply SPLK to simulated and real datasets and compare its performance 


with the competing methods, DDM (Park et ah, 2011) and SPGP (Snelson, 2007). We 


selected the former to evaluate the SPLK’s ability to handle non-stationarity and the 
latter to evaluate SPLK’s ability to speed up GP compared with a state-of-the-art sparse 
approximation method. We also present sensitivity analysis regarding the parameters in 
SPLK and propose some guidelines for their selection. 


4.1 Datasets and evaluation criteria 

We implement SPLK in MATLAB and test it on eight different datasets that contain 
both non-stationary and stationary data structures (see Table [I]). The first dataset, TCO 
contains data collected by NIMBUS7/TOMS satellite to measure the total column of ozone 
over the globe on October 1, 1988. This dataset is highly non-stationary and constructed 
on a two-dimensional input space; thus, it is an appropriate dataset for comparing SPLK 
and DDM. 

Recall that one advantage of SPLK over DDM is that SPLK can handle higher dimen¬ 
sional data. Since we could not obtain publicly available high-dimensional large spatial 
datasets, we appeal to both simulated data and modified real data. Specifically, TCO-3D 
is the upgraded version of TCO in a three-dimensional space, we added a new dimension 
to TCO dataset in order to use such a non-stationary and real structure in a higher di¬ 
mensional domain. To add a new dimension to the domain we define a vector z with a size 
equal to the number of training data points, where each element of z is generated from 
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a uniform distribution U[a b\. By attaching this vector to the original input space x, we 
upgrade the domain to a higher dimensional space. 

We also need to update the TCO’s response values, so that each observation is a function 
of both and while the current noisy observations are only a function of the original 
input space, i.e., yi — /(x^) + e. Consequently, we need to manipulate / as well. Note that 
we do not intend to change the non-stationary structure of /; thus, we add a new term to 
/ which can be explained by the new variable Zi, 

y'i = /(xi) + g(zi). (47) 

Since the form of / is unknown, we use the original noisy observations yi to obtain yfi , i.e., 

y'i = yi + g(zi). (48) 


Note that the function g should be defined so that it does not cancel out the non- 
stationary structure of the underlying function /. We satisfy this intention by defin¬ 
ing g under the restriction that \fi\ \gi\. Consider that the new labels have the 
same amount of noise as the original labels, since we do not manipulate and g is 
a noiseless function. We generate the vector z from the uniform distribution between 
[—50,50], and define g as: g(z{) — —2.b\cos(zi)\. Consequently, the new dataset turns 
to D — {([xi,^], 2 /i + g(zi)) : i = 1 , ...,n}. Note that the original labels range between 
[167,540], whereas the range of g [0,2.5] satisfies the stated restriction. 

The second dataset, GPML, is a synthetic dataset generated by a stationary GP gen¬ 
erator using the GPLM-RND function implemented by (Rasmussen and Williams, 2006) 
The third dataset, OzoneL2, is another satellite dataset from NASAj^] and has a two- 
dimensional input space. We upgrade OzoneL2 by adding two synthetic dimensions to 
it along with preserving the non-stationary structure to construct OzoneL2-4D. The fifth 
dataset, Power Plant, is a four-dimensional dataset of data collected over six years (2006- 
2011) from a Combined Cycle Power Plant. Its features consist of hourly average ambient 
variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust 
Vacuum (V) to predict the plant’s net hourly electrical energy output (EP)J^] 

The Syn-3D and Syn-5D datasets are synthetic three-dimensional and five-dimensional 
datasets, respectively, generated from uniform distribution and labeled by a fluctuated 
function with the form of, 


f{x) = exp(|Ep^| c ) cos (T,qiXi) + e, e ~ U[a, 6], 


(49) 


where pi and qi are the weights of dimension in exponential and cosine functions, 
respectively, and e is the label noise. Function (49) is capable of generating a fluctuated data 


1 http: / / www.gaussianprocess.org/gpml 
2 http://disc.sci.gsfc.nasa.gov 
3 http: / / archive, ics.uci.edu/ml 


17 








Dataset 

Type 

Stationarity 

Dimension 

Number of instances 

TCO 

Real 

Non-stationary 

2 

48331 

OzoneL2 

Real 

Non-stationary 

2 

60863 

TCO-3D 

Semi-real 

Non-stationary 

3 

48331 

Syn-3D 

Synthetic 

Non-stationary 

3 

80000 

Power Plant 

Real 

Stationary 

4 

9568 

GPML 

Synthetic 

Stationary 

4 

50000 

OzoneL2-4D 

Semi-real 

Non-stationary 

4 

60863 

Syn-5D 

Real 

Non-stationary 

5 

50000 


Table 1: Characteristics of the eight datasets 


structure with close to non-stationary behavior. We use p — [1,1,—1],# = [0.2,0, 0], c = 
0.3, a = —5, b — 5 for Syn3D dataset, and p = [1, —1,1,1,1], g = [0.2,0, 0, 0, 0], c = 0.3, a = 
—2, b — 2 for Syn5D. 

We use two measures to evaluate the performance of each method. The first one is the 
measure of prediction accuracy, which is assessed by the Mean Squared Error (MSE), 

1 T 

MsE = -y>:- M t) 2 , (50) 

i=i 

where y\ is the noisy observation of the test location x* and \i\ is the value of the mean 
predictor of this test location. We randomly partition each dataset into 90% for training 
and the remaining 10% for testing. The second measure is the training time, which is 
used to evaluate the SPLK’s success in speeding up the GP. Note that the training time is 
not an appropriate measure on its own and the corresponding MSE should be taken into 
account simultaneously, since a reduction in training time without an accurate prediction 
is not useful. 

4.2 Computation time and prediction accuracy 

Here, we compare the training time and the prediction accuracy of the proposed algorithm 
with those of DDM and SPGP. Specifically, we consider the MSE as a function of the 
training time and plot the set of best results for each algorithm. Under this criterion the 
algorithm associated with the curve closest to the origin will be superior. The parameter 
selection for each model is as follows. 

SPLK has three parameters: A as the measure of control points density, N as the 
number of subdomains, and k as the measure of density for local pseudo-inputs. We keep 
the value of A to 3 for the eight datasets, since it is not a critical tuning parameter and 
increasing it does not affect the efficiency of the algorithm (see Section [4~4| ) . Also, for a 
fixed we vary the size of the subdomains for each dataset. Figure [3] shows the values of 
N and k. As N increases, the computation time of the algorithm improves, so the points 
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with smaller times belong to larger values of N. For the derivations of MSE for different 
N see Section 14.3.11 

DDM has two parameters: A as the number of control points on each boundary and 
N as the number of subdomains. The role of A in DDM is the same as its role in SPLK 
for two-dimensional spaces, and similarly the algorithm works efficiently with A equal to 
3; thus, for the sake of comparison with SPLK, we also set parameter A at 3 for DDM. 
We vary N in the range of {20,40,88,156,314,480,759} for the specific dataset TCO, 
Figure |3a| As it is expected for smaller values of N , i.e., larger subdomains, the efficiency 
of the algorithm deteriorates in terms of time; therefore, points with higher computation 
time belong to smaller values of N. On the other hand, the trend of change in the MSE 
may vary for different datasets; however, the MSE remains stable for the two-dimnesional 
datasets used in this study. 

The only tuning parameter for SPGP is the number of pseudo-inputs, m. As m in¬ 
creases, the computation time increases, while the MSE either decreases such as Figure |3a| 
or fluctuates around a specific value such as Figure [3d] depending on the data structure. 
We vary m in the range of {8,16,32,64,128,256} for the eight datasets. Note that for 
some datasets such as TCO and Syn-5D, Figures |3a| |3f| the value of the MSE is too large 
for small m and therefore we cut off the curve in order to be comparable to the other plots. 

For the dataset TCO, Figure 3a shows that both SPLK and DDM work faster and 
more accurately than SPGP, which is not surprising, since sparse approximation methods 
alone cannot fully address non-stationarity. Note that both SPLK and DDM are similar 
in training time, due to the lower number of control points in the two-dimensional space. 
DDM, however, works better than SPLK, because it can fit a full GP in each subdomain. 

We are unable to compare the other datasets with DDM, because DDM’s implementa¬ 
tion is restricted for one- or two-dimensional spaces. Figures |3bl |3c| and |3f| show that SPLK 
is more efficient for non-stationary structures than SPGP, whereas there is no significant 
difference between SPLK and SPGP for the stationary datasets. However, it seems that 
SPGP is more efficient for stationary data structures, because it has less computational 
burden and approximation than SPLK. 
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Figure 3: MSE versus Training Time for SPLK, SPGP and DDM. A = 3 for both DDM and 
SPLK. rn for SPGP e {8,16,32,64,128,256}, N for DDM £ {20,40,88,156,314,480, 759} 



















4.3 Sensitivity analysis 

We present some guidelines for selecting the parameters in SPLK and show how their 


selection affects SPLK’s performance. Specifically, Section |4.3.1| discusses the number of 
cuts for a data domain which equivalently means the number of data points that fall in 
each subdomain. Section 4.3.1| also discusses some sensitivity analysis regarding the choice 
of k in (43), which determines the number of pseudo-inputs. Section 4.4 discusses the effect 


of the number of control points that we put on each boundary. 

4.3.1 Number of cuts and pseudo-inputs 

Flexibility in choosing the size of subdomains gives SPLK more freedom to modify the size 
to obtain the desired efficiency. Based on the experiments in this study, we suggest that 
each subdomain should have 500-5000 data points. Selecting less than 500 data points for 
each subdomain creates a large number of subdomains which in turn can reduce SPLK’s 
prediction accuracy, whereas more than 5000 points becomes computationally burdensome 
. Below, we discuss the trade-off between accuracy and computational efficiency and other 
factors that can affect deciding the sizes of the subdomains and the number of cuts. 

Along with the number of subdomains TV, we need to consider the data structure and 
the number of local pseudo-inputs rrij. In the current model, rrij is a function of the non- 
stationarity coefficient fc; consequently, we consider k instead of rrij in our analysis. To 
illustrate the effect of the number of cuts on the algorithm’s efficiency, we plot the MSE 
and training time of SPLK as functions of TV for varying k. Each curve in Figures [4j [5] and 
[6] shows the MSE or training time for varying TV for a fixed k. 

The trend of changes in accuracy of the algorithm with varying number of subdomains 
differs depend on data structure. For stationary structures, the number of cuts has in¬ 
significant effect on prediction accuracy; consequently, we prefer a meshing policy with 
more cuts, since it reduces the training time while keeps the precision fixed. Figures 4a 
and [4b] show the trend of changes in the MSE for the GPML and Power Plant datasets; 
these figures do not suggest a monotonous trend in MSE for varying TV and whereas 
their training time decreases as the number of cuts increases. 

On the other hand, non-stationary data structures behave in two different ways. For 
the datasets like OzoneL2-4D and Syn-5D, as the sizes of subdomains increases, SPLK 
yields more precise prediction and this precision converges to the best result, which can be 
obtained through the algorithm regardless of the number of utilized local pseudo-inputs. 
Figure [Fb| and |5a| show that every curve converges to the same precision regardless of the 
value of k. However, with a larger k the convergence rate is faster, and as a result SPLK 
reaches its lower bound MSE with more subdomains. Recall that in terms of computational 
time, it is more efficient to use smaller subdomains. Therefore, we suggest utilizing a 
sufficiently large k in smaller subdomains, since it reduces the MSE. Note that if k is larger 
than some threshold, the training time increases without improving the MSE. Figure |5a| 
shows that the algorithm starts obtaining its best result from TV = 40 with k — 1.5 and 2, 
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whereas the result is not achievable for k = 1 and 0.5, even with 10 cuts. Figure [5c| shows 
the algorithm’s computational efficiency. Figure |5b| represents the above mentioned fact 
for OzoneL2-4D, which for k = 2.5 and 3, we obtain good results with N = 20, whereas 
we need to lessen the cuts in order to obtain the same result for smaller values of k. 

For TCO and TCO-3D, the experiment results in convex functions with a minimum 
value. Here, the MSE does not converge to a lower bound as N decreases, but for different 
values of k we observe that both the functions, which minimize the MSE in a particular 
range of cuts, and the minimum values become smaller for the larger number of local 
pseudo-inputs. Figure [6] shows that when N = 40, the best result is obtained for each 
curve and that it improves as k increases. 


It is also informative to illustrate the impact of A; in (43) on the algiorithm’s accuracy 


and efficiency. Recall mj = k^/rij, where nrij and rij are the number of pseudo-data points 
and actual data points in domain j, respectively. We vary the value of k for five different 
datasets with different levels of stationarity, keeping rij fixed, to measure the MSE and 
the training time. Figure [7a] indicates that for stationary datasets such as GPML and 


PowerPlant, the varying values for k results in the same MSE, whereas for the other non¬ 
stationary datasets, the predictive accuracy depends on k. Moreover, Figure [7b| shows the 
training time increases monotonically by the growth of k regardless of the changes in the 
MSE. 


4.4 Control points density 

Control points density in the boundaries of subdomains is another tuning parameter that 


we need to determine in SPLK. As discussed in Section [34| we divide a boundary hyper¬ 
rectangle into a number of folds from each dimension of the space; parameter A signals the 
control points density on each boundary hyper-rectangle. Consequently, as A increases, the 
folds become smaller and more control points are required to cover the whole boundary. 
Figure [8] shows the location of control points on one-, two- and three-dimensional boundary 
hyper-rectangles with parameter A = 3. 

Surprisingly, a uniform distribution of the control points works well with small values of 
A. Increasing the number of control points does not improve the prediction accuracy, but it 
does increase the computational burden and as a result the training time. To demonstrate 
the effect of the control points density on the efficiency of the algorithm, we present the 
results of testing SPLK on four datasets with varying values of A with all other parameters 
fixed. Table [2] indicates that an increase in the value of A does not improve the accuracy 
for the four datasets, and there is an insignificant effect on the training times for the two-, 
three- and four-dimensional spaces. The training time does increase for the dimensional 
spaces greater than four. This is not surprising, since for high dimensional spaces {d > 4) 
the expansion rate of the space is high and just adding one more fold to each side of 
the boundary hyper-rectangles makes the model locate many more control points, thus 
adding to the computational burden. We conclude that locating many control points for 
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Dataset 

A 

Time 

MSE 


3 

39.62 

23.98 

TCO 

4 

40.32 

24.03 


5 

40.52 

23.20 


3 

42.36 

6.81 

Syn-3D 

4 

40.43 

6.59 


5 

50.48 

6.79 


3 

50.42 

34.36 

OzoneL2-4D 

4 

48.65 

35.17 


5 

51.54 

32.69 


3 

39.80 

14.11 

Syn-5D 

4 

62.96 

15.04 


5 

188.14 

15.03 


Table 2: Effect of A on efficiency of SPLK 


dimensional spaces greater than four has little benefit. 

5 Summary 

This paper proposed a Sparse Pseudo-input Local Kriging (SPLK) approximation method 
for GP regression. GP regression is a powerful tool in the analysis of spatial systems, 
but it does not scale efficiently to large datasets. In addition, many spatial datasets have 
highly heterogeneous covariance structures which cannot be modeled effectively with a sin¬ 
gle covariance function. SPLK simultaneously addressed scalability and heterogeneity by 
partitioning the data’s domain into subdomains. The partitioning used orthogonal cuts to 
fit a sparse GP to the data in each subdomain. As a result, SPLK allowed selection of the 
partitions having larger numbers of data points. SPLK also guaranteed the continuity of 
the overall prediction surface by putting a few control points on the boundary of neighbor¬ 
ing partitions. The orthogonal cuts, the selection procedure for the control points, and the 
sparse schemes allowed application of SPLK to spatial datasets with higher dimensions, 
i.e., data that included exogenous variables in addition to the location information. SPLK 
was applied to eight real and synthetic datasets. The results showed that SPLK main¬ 
tains a good balance between the prediction accuracy and computation time for data with 
stationary or non-stationary covariance structures. 

The limitations of SPLK can be better understood in a more comprehensive study that 
uses a larger number of real spatial datasets with exogenous variables. Based on the cur¬ 
rent results, we suggest two paths for future research. First, instead of orthogonal cuts, 
we could utilize non-orthogonal cuts, i.e., cutting the domain with hyperplanes that are 
not parallel to any axes. This could allow more flexibility in choosing the subdomains. 
Specifically, if a cluster of data points exhibits a specific behavior that can be captured 
using a covariance function, then a non-orthogonal cut can more effectively isolate those 
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data points. Second, we could select subdomains such that in some of them instead of 
a sparse approximation, we could use a full GP. This is contingent upon identifying ho¬ 
mogenous domains, or equivalently cuts, that contain a small number of data points, to 
which we can apply a full GP efficiently. A combination of both tasks, non-orthogonal cuts 
and application of a full GP in some domains, would provide an adaptive method which 
can efficiently predict the response surface. These tasks require more data analysis and 
algorithmic development which will be an ongoing pursuit. 
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Figure 4: Prediction accuracy and training time for varying number of subdomains for 
stationary datasets. Figures 4a and 4c use the PowerPlant dataset; Figures 4b and 4d use 
the GPML dataset. Each function is associated with a particular fc, where k = m/yjn , 
n = D/N. 
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(a) Syn-5D (MSE) 
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Figure 5: Prediction accuracy and taring time for varying numbers of subdomains for 
non-stationary datasets with convergence behavior. Figures 5a and 5c use the Syn-5D 
dataset; Figure 5b and 5d use the OzoneL2-4D dataset. Each function is associated with 
a particular /c, where k — m/y/n, n — D/N. 
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(a) TCO (MSE) 


(b) TCO-3D (MSE) 


Figure 6: Prediction accuracy for varying number of subdomains for non-stationary 
datasets with minimum function behavior.Figure 6a uses the TCO dataset; Figure 6b 
uses the TCO-3D dataset. Each colorful function is associated with a particular fc, where 
k = m/y/n, n = D/N. 



(a) k vs MSE (b) k vs Time 


Figure 7: Efficiency of the model with different choices of k 
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Figure 8: Control point distribuion on one-, two-, three- dimensional hyper-rectangles with 
A = 3 
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